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Abstract 

We present implementations of a fourth-order symplectic integrator on graphic 
processing units for three A^-body models with long-range interactions of general 
interest: the Hamiltonian Mean Field, Ring and two-dimensional self-gravitating 
models. We discuss the algorithms, speedups and errors using one and two GPU 
units. Speedups can be as high as 140 compared to a serial code, and the overall 
relative error in the total energy is of the same order of magnitude as for the 
CPU code. The number of particles used in the tests range from 10,000 to 
50,000,000 depending on the model. 

Keywords: Molecular dynamics; Symplectic integrator; Long-range 
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1. Introduction 

The dynamics of systems with long-range interactions have been intensively 
studied over the last two decades due to their unusual and intriguing phe- 
nomenology, such as the existence of quasi-stationary non-Gaussian states with 
diverging life-times with the number of particles, negative microcanonical heat 
capacity, inequivalence of ensembles and non-ergodicity [U [2j [3j HI [6j [7] . Self- 
gravitating systems [B]. non- neutral plasmas [91 HO] and models as the Ring 
model [TTJ [TJ] Hamiltonian Mean Field (HMF) [TJ], one-dimensional gravity 
(infinite uniform density sheets) [21 [T5J [TB] , two-dimensional gravity (infinite 
uniform rods) [T7] , Free Electron Laser [TH] and plasma single wave models [T5] 
are among many examples of systems with long-range forces. A pair potential 
interaction has a long-range if it scales for large distances as r~ a with a < d, 
r the intcrparticlc distance and d the spatial dimension. This slow decaying 
interparticle potential is responsible for the coupling of distant components of 
the system in such a way that all particles contribute to the dynamics of a 
given particle. For out of equilibrium situations, many of these studies rely 
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on molecular dynamics simulations, i. e. solving numerically the Hamiltonian 
equations of motion for the iV-particle system. It is also a well known fact that, 
under suitable conditions, the statistical description of the dynamics of long 
range interacting systems is equivalent to the Vlasov equation in the N — > oo 
limit [UI15]. 

Here we present a CUDA implementation [5T] of Molecular Dynamics (MD) 
algorithms on Graphics Processing Units (GPUs) to solve the Hamiltonian equa- 
tions of motion for such systems, using one and two GPUS. The algorithms used 
can be efficiently extended for any number of GPUs. Molecular Dynamics sim- 
ulations have been extensively used to study many properties of these systems 
from first principles (see for instance Reference [T| an( i references therein). Some 
numerical parallel algorithms were implemented in the literature with applica- 
tions ranging from condensed matter to astrophysics [22] [23] [24] [25] [26] [27] . 
Although MD simulations are widely used in the study of long-range systems, 
only recently a GPU code of a symplectic integrator was implemented and used 
by the author and collaborators alongside a GPU Vlasov solver to study non- 
equilibrium phase transitions in the HMF model [55] HE I3H] ■ In such studies 
simulations are performed without introducing any simplifying hypothesis such 
as the Particle-Mesh technique [31] or similar methods commonly used in con- 
densed matter physics [55]) with a full N-body method with a computational 
effort scaling as N 2 . For some models the explicit form of the pair interaction 
potential allows further simplifications with important reduction in computa- 
tional time. 

The structure of the paper is the following: in section [2] we describe the 
model systems that are studied in this paper. Section [3] presents the main 
algorithms and how they are implemented in CUDA, and in section[4]we present 
the timings and speedups for the three models taken as examples, and discuss 
how the relative error in the energy behave for each of them. We close the paper 
with some concluding remarks in section [5] 



2. Models with long-range interactions 

Here we present some simplified models that are discussed in the present 
work. The Ring model is composed of N particles with unit mass on a ring 
of radius R and interacting though a regularized gravitational potential with 
Hamiltonian [TT1[T2] : 

1 N 1 N 1 

2—1 l<^J — \ v V \ 1 J ' 

with e a softening parameter introduced to avoid the divergence at zero distance 
and 9i denotes the position angle of a particle on the circle. By considering the 
limit for large values of the parameter e we obtain the Hamiltonian Mean Field 
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(HMF) model with Hamiltonian [13]: 

N N 

H =2^ P * + N ? [l-cos^-^)]. (2) 

i—l i<j — l 

The two-dimensional self-gravitating particles is composed of identical par- 
ticles with unit mass. By solving the Poisson equation in two dimensions we 
obtain the Hamiltonian [TT] : 

1 N 1 N 

H =2J2(Pli +p lJ + N ? ^[{xi-Xjf + iyi-yjf + e], (3) 

2=1 i<j=l 

where p x ,i> Py,i, &i and are the x and y components of the momentum and the 
coordinates for the z-th particle, respectively, and e is again a (small) parameter 
used to avoid the divergence of the potential at zero distance. 

3. Algorithms and CUDA implementation 

The standard integration method for the numerical solution of Hamilton 
equations for the model systems described in the previous section is a fourth- 
order symplectic integrator. Here we adopt the Yoshida symplectic integrator 
that approximates the evolution operator as |33] : 

e Ml H _ e B AtL v e D AtL K e B t AtL v 

xe D 1 AtL K e B x AtLv e D AtL K e B AtL v _|_ Q (At 5 ) (4) 

where At is the integration step and B , B\, D and D\ are numerical constants 
obtained in Ref. [33] , Lh = {H, }, Ly = {V, } and Lk = {K, } with K and V 
the kinetic and potential energies respectively and {H, } stands for the Poisson 
bracket of H with any function it operates on. The approximation in eq. Q is 
time reversible. A whole integration step is given by the following steps where 
f (r) stand for the force array at positions r 1; . . . , r^: 

1. fO = f (r(t)), 

2- P (J) = p(t)+B AtfW, 

3. r« =r(t) + D Atp( / ), 

4. f(»)=f(rW), 

5. =p(- r )+B 1 Atf("\ 

6. rW=r( / ) + J D 1 Atp^) ) 

7. fC") = fXrC')), 

8. p( /7/ ) =p( // )+S 1 Atf( /// ), 

9. r(t + At) = r<"5 + D At p( IU \ 

10. f(t + At) = f(r(t + At)), 

11. p(t + At) = p( m ) + B At fW, 

12. Compute any properties of interest, then go to step (2) with f( 7/ ) = f(t + 
At). 

Each time step requires three force calculations, being the most demanding step. 
We now present the specific CUDA implementation for each model in section [2] 
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3.1. HMF model 

From the Hamiltonian in eq. ^ the potential energy and the force on particle 
i can be written respectively as 

V = f [1 - Ml M 2 ] , (5) 

and 

fi = cos^M, - sm(6i)M x , (6) 
where the components of the "magnetization" are given by 

N N 

M * = jj E cos (^)' M y = n E sin (^)- ( ? ) 

i=l i=l 

We note that due to the form of the interaction potential the symplectic inte- 
gration time for the HMF scales with N in contrast to the usual N 2 scaling 
for the other two models below. The most demanding part of an integration 
step is the computation of the sine and cosine of the position angles for each 
particle. For the force in cq. ([6| each cos(6 l i) and sin(0,*) is used twice: to com- 
pute M x and M y and to obtain the force on each particle from expression ([6]). 
To avoid redundant computations their values are first computed and stored in 
an array in the GPU global memory taking care to ensure coalesced memory 
access, which is an important issue in any CUDA implementation. A CUDA 
kernel is composed of blocks each with a given number of threads. Each thread 
in a block computes the value of the cosine and sine of a given particle and the 
corresponding values are also stored in shared memory. A reduction procedure 
is then used to compute the values of M x and M y and the force is obtained 
using the precomputed values of sin(0j) and cos(0j). The potential energy is 
trivially obtained from eq. ([5]) and the kinetic energy is efficiently computed 
using a straightforward reduction procedure. 

For two GPUs half of particle data (position, momenta and force) is stored 
in each GPU which then computes the magnetization components for its corre- 
sponding number of particles, and their sum give the total magnetization com- 
ponents. The forces on each particle are the trivially computed for the particles 
on each GPU, and the subsequent evolution is also performed independently by 
each GPU for its set of particles. 

3.2. Ring model and self-gravitating 2D system 

The computational time for symplectic integration of the Ring model scales 
as N 2 . Here we follow a strategy similar to the one described in Ref. [34] based 
on a decomposition of the force calculation in tiles as depicted in Figure [l] The 
force on particle i is obtained from Hamiltonian ([I]) as: 



> ' - 2. /V. *> " 2 ^ 2N (1 _ cos( ^ _ 9j) + e) 3/2 ■ 
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^ Tile 



Figure 1: Decomposition in tiles for the computation of the force components /; for the Ring 
and self-gravitating models. 

This last expression can be rewritten as 

1 sin 6i cos 8j — cos 6>,; sin Oj 

' 13 ~ 2y/2N (1 - cos 9 t cos j - sin 9, sin 9j + e) 3/2 ' 

Each sin 9i and cos 0i for i = 0, . . . , N is computed and stored in global memory 
in order to avoid computing twice their values. Each tile has iVthreads (the 
number of threads in a block) particles in the horizontal direction (index i in 
Fig.[lJ and iVthreads m the vertical direction (index j). The total number of tiles 
in each direction is thus Nuiea — iV/iVthreads- The algorithm can be expressed 
as: 

1. Store fk — and the values of sin6> fc and cos9 k , k — I N thl . cads , ...,(/ + 
1) -<Vthreads~ 1 in shared memory, with / the block number, and synchronize 
threads in the block. 

2. m = 0; 

3. Store sinf?fc and cos 6^, k = m iVthreads > ■ ■ - ,(m + l) iV t h re ads — 1 in shared 
memory and synchronize threads in the block; 

4. for k = m iVthreads, • • • j {m+ 1) iVthreads — 1 compute fik and sum its values 
to fn 

5- j = j + 1 and goto step (3) while j < N tiles . 

Newton's third law is not used here as its implementation in a CUDA kernel 
would introduce unnecessary complications with no significant speed gain. The 
computation of is performed using the CUDA function rsqrt(x) with sig- 

nificant gain in computing time and no significant loss in accuracy. 

For the two-dimensional self-gravitating system the approach is essentially 
the same as for the Ring model described above, except for the number of com- 
ponents for each particle (two) and the interpaticle force. In this case, a signifi- 
cant gain in speed without significantly compromising accuracy consists in using 
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double precision for the storage of all data in global memory but storing each 
component of the position in single precision shared memory before computing 
the force on each particle, which allows to load more blocks concomitantly on 
the GPU, and therefore augmenting occupancy. The force between two given 
particles is then computed in single precision but added in a double precision 
variable to determine the total force on a given particle. All remaining compu- 
tations are performed in double precision. For determining the timings and for 
comparison purposes the same is done for the CPU code. 

For two CPUs, each one computes separately the forces for half of the par- 
ticles, which contrary to the HMF model requires the positions for all particles. 
Therefore each GPU stores in global memory half on the momenta, half of the 
forces but all the particle position coordinates. As for the HMF model the time 
evolution is performed independently on each GPU for its respective set of par- 
ticles, and only half of the positions on each GPU is up to date after a time step. 
To compute the force we first copy half of the updated values from one GPU to 
the other using an asynchronous memory copy. Then each GPU computes half 
of the forces with iVtiies/2 tile in the horizontal and iVtiies tiles in the vertical 
directions. The whole computation is synchronized after all forces are evaluated 
using stream synchronization, as shown in Fig. [2] A single partial integration 
step composed of a force calculation and free drift with constant momentum 
and a force increment with constant position (steps 1, 2 and 3 in section [3] for 
instance) can be summarized for both the Ring and 2D self-gravitating model 
as (each GPU has a stream defined for it): 

1. Asynchronous copy of the position components of particles 1, . . . , N/2 
from GPU to GPU 1. 

2. Asynchronous copy of the position components of particles N/2 + 1, . . . , N 
from GPU 1 to GPU 0. 

3. Stream synchronization for each GPU. 

4. Launch a kernel to compute the components of fj, i = 1, . . . , N/2. 

5. Launch a kernel to compute the components of fj, i = N/2 + 1, . . . , N. 

6. Stream synchronization for each GPU. 

7. Launch a kernel to update halt of the momenta and half of the position 
coordinates on GPU 0. 

8. Launch a kernel to update halt of the momenta and half of the position 
coordinates on GPU 1. 

9. Stream synchronization for each GPU. 

The asynchronous copy between GPUs in steps (1) and (2) above is efficiently 
handled by the CUDA routine cudaMemcpyPeerAsync. 

4. Results and discussion 

The computer used for the simulations is an i7-2600/ 3.40 GHz and 16 GB 
of RAM. The GPU is a GeForce GTX 690 dual with a Kepler Architecture, 
2048 MBytes of memory and 1536 CUDA Cores for each GPU. Results are 
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N tiles 
► i 2 



GPU 1 



Positions update 
< ► 



N tiles 
-> i 2 



GPU 2 



N 



tiles 



Streams synchronization 



Figure 2: Using two GPUs to compute the forces for the Ring model and 2D self-gravitating 
system. 
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presented using one and both units Tables [T] [2] and [3] show the timings for a 
complete time step for the HMF, Ring and 2D gravity models, respectively. 
All CPU implementations were optimized on a single CPU core but parallel 
implementations on many cores on the CPU are also possible with a maximum 
speedup given by the number of cores. The speedups obtained range from 23 
to 73 for the HMF model, 104 to 141 for the Ring model and 53 to 100 for 
the 2D self-gravitating system. The greater the number of particles the greater 
the speedup due to a higher occupancy of the GPU cores. For the same reason 
using more than one GPU becomes more advantageous for a large number of 
particles. It is interesting to note that the smaller speedups were obtained for 
the HMF model. A similar result is obtained in Ref. 30J for the solution of 
the Vlasov equation and is a consequence of the scaling property with N of the 
computational time and the fact the CPU code being already highly optimized. 

To assess the accuracy of the present approach, we consider all three systems 
and compare the relative error in the energy e = \ {E — Eq)/Eq\ for the different 
implementations. For the sake of comparisons, the error for the CPU case is 
obtained using double precision in all steps. For all three system we consider an 
initial state with all particles initially at rest. For the HMF and Ring models 
the initial state is spatially homogeneous. Figures [3] and [4] show the plot of the 
kinetic and potential energies and the corresponding error e for a time window 
large enough to encompass the initial violent relaxation of the system which is 
more pronounced for the HMF model. For the latter, the error for the CPU 
and GPU implementation are indistinguishable in the plot. For the Ring model 
both cases are always very close. Thence computing the force between the two 
particles using single precision arithmetic do not compromise the overall accu- 
racy. For the two-dimensional self-gravitating system, we consider all particles 
initially at rest and homogeneously distributed on a circular shell of inner and 
outer radius R\ = 1.0 and i? 2 = 1.5. The system then undergoes a violent 
relaxation towards a quasi-stationary state with some damping oscillations as 
show in Fig. [5j Figure [6] shows the plot of the error for a CPU implementation 
(in FORTRAN) using double precision for all steps, and the error for the GPU 
implementation also in full double precision, and GPU implementation using 
single precision for the computation of forces between pairs of particles, as de- 
scribed above yielding the same order of magnitude for the error. The error 
grows as the particles get closer during the violent relaxation with a threshold 
at the same values for the two GPU implementations. 

5. Conclusions 

We presented implementations in one and two GPUs of a forth order sym- 
plectic integrator for three different systems with long-range interactions. Those 
systems have been extensively studied in the literature and the present imple- 
mentations allowed a more extensive investigation of properties such as the na- 
ture of non-equilibrium phase transitions and non-ergodic behavior in the HMF 
model [29l [35] . Shared memory is used to allow faster memory access crucial in 
GPU implementations. Since it is a very limited resource using single precision 
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N 


CPU 


Single 


Dual 


10 5 


0.014 


6.3xl0~ 4 


6.2x10" 


4 


10 6 


0.14 


4.0xl0~ 3 


2.2x10" 


3 


10 7 


1.40 


3.7xl0~ 2 


2.0x10" 


2 


5 x 10 7 


6.90 


1.81X10" 1 


9.5x10" 


2 



Table 1: Run times in seconds for a single complete time step for the HMF model with N 
particles for CPU, and the GPU used as single and dual units. 



N 


CPU 


Single 


Dual 


Tile size 


10, 240 


2.8 


0.04 


0.027 


256 


51,200 


65.0 


0.95 


0.54 


512 


102,400 


250.4 


3.9 


1.98 


512 


512,000 


6,726.9 


97.5 


47.8 


1024 



Table 2: Run times in seconds per time step for the Ring model with N particles . 



N 


CPU 


Single 


Dual 


Tile size 


10,240 


0.58 


0.023 


0.011 


256 


20,480 


2.48 


0.047 


0.025 


256 


102,400 


57.6 


0.99 


0.58 


512 


512,000 


1,491.9 


26.8 


17.42 


512 


1,024,000 


6,455.9 


106.5 


64.6 


1024 



Table 3: Run times per time step in seconds for the 2D self-gravitating system. 




t t 



Figure 3: Left panel: Kinetic and potential energy as a function of time for the HMF model. 
Right panel: Relative error for the CUDA implementation for the HMF model. The error for 
the CPU versions is virtually the same and would be indistinguishable on the graphic. The 
simulation data are e = 10~ 3 and At = 0.1 for a homogeneous initial condition with particles 
at rest. 
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4 "~ 6 

t 



Figure 4: Left panel: Kinetic and potential energy as a function of time for the Ring model 
with e = 10 -3 , At = 0.005. Right panel: Error for the CPU implementation in full double 
precision and the GPU implementation. 




t 



Figure 5: Kinetic and potential energy as a function of time for the two-dimensional self- 
gravitating system, with e = 10 -3 , At = 0.005. 




t 



Figure 6: Relative error in the total energy for the self-gravitating system for the CPU and 
GPU implementations. For comparison purposes all computations on the CPU implementa- 
tion were performed in double precision. The right panel is a zoom over the region comprising 
the violent relaxation stage. 
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floating point storage optimizes its use. Some steps of the computation, such as 
the force between pairs of particles, or the inverse of the interparticle distance, 
can be performed in single floating point precision without compromising the 
overall error in the simulation. The speedups obtained range typically from 30 
to 140 depending on the system and the number of particles, simulations that 
would be unfeasible in any reasonable time using a typical CPU. The gener- 
alization of the present approach to multi-GPU (more than two) and other N 
particle systems is straightforward. 
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